Members
Overall Objectives
Research Program
Software and Platforms
New Results
Partnerships and Cooperations
Dissemination
Bibliography
XML PDF e-pub
PDF e-Pub


Section: New Results

Combinatorics and Annotation

Word counting and random generation

A long-term research on word enumeration has been realized by the team, in order to calculate a statistical significance for a pattern occurrence according to a given background model. As a part of E. Furletova’s thesis, defended in February 2012, co-advised by M. Roytberg (Impb , Puschino, Russia) and M. Régnier, an extension to Hidden Markov Models, SufPref , has been proposed. It relies on a new concept of overlap graphs that efficiently overcomes the main difficulty - overlapping occurrences - in probabilities computation. An implementation is available at http://server2.lpm.org.ru/bio/online/sf/ . This algorithm provides a significant space improvement over a previous algorithm, AhoPro developed with our former associate team Migec . Word statistics were used to identify mRNA targets for miRNAs involved in carcinogenesis [13] .

Large deviation results have been derived in [41] that take advantage of general combinatorial properties of words. First, an approximation is derived for the double strands counting problem that refers to a counting of a given pattern in a set of sequences that arise from both strands of the genome. Here dependencies between a sequence and its complement plays a fundamental role. Second, sets of small sequences, with non-identical distributions, are addressed. Possible applications are the search of cis-acting elements in regulatory sequences that may be known, for example from ChIP-chip or ChipSeq experiments, as being under a similar regulatory control.

In [21] , we developed a new algorithm for generating uniformly at random words of any regular language L. When using floating point arithmetics, its bit-complexity is O(qlogn) in space and O(qnlogn) in time, where n stands for the length of the word, and q stands for the number of states of a finite deterministic automaton of L. We implemented the algorithm and compared its behavior to the state-of-the-art algorithms, on a set of large automata from the VLTS benchmark suite. Both theoretical and experimental results show that our algorithm offers an excellent compromise in terms of space and time requirements, compared to the known best alternatives. In particular, it is the only method that can generate long paths in large automata. Moreover, in [10] , in collaboration with the Fortesse group at Lri , we presented several randomised algorithms for generating paths in large models according to a given coverage criterion. This work opens new perspectives for future studies of statistical testing and model checking, mainly to fight the combinatorial explosion problem.

Analysis and design of weighted combinatorial models

Weighted context-free grammars are natural – yet powerful – random models for biological sequence and structures. We furthered our developments on these objects, and applied them to the study of the Boltzmann ensemble of low-energy in RNA.

In collaboration with P. Clote (Boston College), we used such analytic combinatorics to establish that the average geometric distance between the terminal ends of an RNA sequence, once folded, is asymptotically constant [8] .

Furthermore, in collaboration with C. Banderier, O. Bodini and H. Tafat (Lipn ), we constructively showed that any predefined distribution of pattern could be attained by a (possibly ambiguous) regular expressions. We also designed a dynamic-programming algorithm to automatically build such models, adopting a segmentation approach based on a parsimony principle. This work was presented at the Analco'12 conference [30] .

Finally, we continued with D. Gardy and J. Du Boisberranger (Prism , Université de Versailles-St Quentin) a joint study of collisions in weighted random generation. Indeed, while performing a random generation within large collections of weighted objects, the probability of any sample can be exactly and efficiently computed. Therefore, any redundancy in the sampled set is uninformative (contrasting with situations where the probability is also estimated by the sampling procedure). Following previous results presented at Gascom'10 (Montreal), we presented at the Aofa'12 (Montreal, Canada) conference [33] , a new close formula for the waiting-time of the coupon collector problem, i.e. the average number of words that one must draw to obtain the full collection. The framework defined here has direct applications in the context of RNA : approaches based on sampling are preferred to deterministic optimizations, and algorithmic efficiency of the methods can be critically affected by the redundancy of sampled sets. .

Scientific Workflows

Several Scientific workflow systems have been designed to support users in the tasks of designing, managing, monitoring, and executing in-silico experiments. Such systems are now equipped of provenance modules able to collect data produced and consumed during workflow runs to enhance reproducibility. In this context, we have worked in two directions. First, we have worked on the problem of reuse between scientific workflows. In particular, we have identified the presence of common or similar (sub-)workflows and workflow elements, and have deeply studied, for the first time in the literature, the problem of cross-author reuse [38] .

Second, we have worked on studying the structure of scientific workflows. More precisely, we have focused on the series-parallel graph structures. Designing sub-workflows, querying or monitoring workflows leads to perform graph sub-isomorphism. This problem is NP-complete when general DAGs are considered but can be solved in polynomial time when graphs restricted to SP graphs are considered. We have designed and implemented the SPFlow algorithm that rewrites any workflow into an SP workflow while ensuring that the provenance of the rewritten workflow is the same as the original [32] , [39] .

We are currently working on identifying the reasons why some scientific workflows have a non SP structure. Our long-term goal is to design a distilling procedure for scientific workflows offering users the ability of naturally designing workflows having a structure close to SP structures. This work is done in close collaboration with the University of Manchester [31] .